Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

نویسندگان

چکیده

We introduce a high performance, multi-threaded realization of the gemm kernel for ARMv8.2 architecture that operates with 16-bit (half precision)/queryKindly check and confirm whether corresponding author is correctly identified. floating point operands. Our code especially designed efficient machine learning inference (and to certain extent, also training) deep neural networks. The results on NVIDIA Carmel multicore processor, which implements architecture, show considerable performance gains kernel, close theoretical peak acceleration could be expected when moving from 32-bit arithmetic/data 16-bit. Combined type convolution operator arising in convolutional networks, speed-ups are more modest though still relevant.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sparse Matrix-vector Multiplication on Nvidia Gpu

In this paper, we present our work on developing a new matrix format and a new sparse matrix-vector multiplication algorithm. The matrix format is HEC, which is a hybrid format. This matrix format is efficient for sparse matrix-vector multiplication and is friendly to preconditioner. Numerical experiments show that our sparse matrix-vector multiplication algorithm is efficient on

متن کامل

Low precision storage for deep learning

Multipliers are the most space and power-hungry arithmetic operators of the digital implementation of deep neural networks. We train a set of state-of-the-art neural networks (Maxout networks) on three benchmark datasets: MNIST, CIFAR-10 and SVHN. They are trained with three distinct formats: floating point, fixed point and dynamic fixed point. For each of those datasets and for each of those f...

متن کامل

Strassen's matrix multiplication for customisable processors

Strassen S algorithm is an efficient method for mulliplying large matrices. We explore various ways of mapping Strassen ' s algorithm inlo reconfgurable hardware that contains one or more customisable instruction processors. Our approach has been implemented using Nios processors with custom inslrucfions and with custom-designed coprocessors, taking advantage of the additional logic and memory ...

متن کامل

Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs

We discuss implementing blocked sparse matrix-vector multiplication for NVIDIA GPUs. We outline an algorithm and various optimizations, and identify potential future improvements and challenging tasks. In comparison with previously published implementation, our implementation is faster on matrices having many high fill-ratio blocks but slower on matrices with low number of non-zero elements per...

متن کامل

Matrix Multiplication on Three Heterogeneous Processors

We present a new algorithm specifically designed to perform matrix multiplication on three heterogeneous processors. This algorithm is an extension of the ‘square-corner’ algorithm designed for two-processor architectures [2]. For three processors, this algorithm partitions data in a way which on a fully-connected network minimizes the total volume of communication (TVC) between the processors ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: The Journal of Supercomputing

سال: 2021

ISSN: ['0920-8542', '1573-0484']

DOI: https://doi.org/10.1007/s11227-021-03636-4